Introduction

The semester project gives you the chance to apply the knowledge and skills you will learn in the course in a way that mimics the ways you can expect to use logistic regression modeling in a real-world setting.

For the project, you will choose one data set from the two listed below and perform a logistic regression analysis. Specifically, you will build a regression model and report on the model statistics and diagnostics. Your final deliverable for the project will be an R Markdown file. The file will contain the analyses detailed in each step with 5-7 written bullet points. Basically, pretend you are presenting to a non-technical audience, and the bullet points would serve as a script outline for your explanation. By non-technical I mean an audience who knows things like “a p-value less than 0.05 mean statistical significance” but cannot really explain the underlying concepts in great detail.

You can choose one of two different data sets to complete the project, either the wine quality data set first analyzed during week 3 or the data set diabetes from the faraway package, which we do not cover elsewhere in the course. Here are the details for each. Remember, you just need to choose one, not both.

Load Libraries

library(tidyverse)
library(ROCR) # <-- for AUC 
library(ResourceSelection) # <-- for Hosmer-Lemeshow Goodness of Fit Test
library(splines) # <-- for splines 

Read and Prepare Dataset

I am chosing the wine quality dataset for my analysis.

wine_red <- read_delim('winequality-red.csv',delim = ';') %>% 
                    mutate(quality=factor(ifelse(quality>5,1,0),labels = c('low','high')))
names(wine_red)<-make.names(names(wine_red),unique = TRUE)

Project Steps:

The project consists of five steps, which you will work on throughout the second half of the semester. The steps you complete each week will accumulate in your R Markdown file, due at the end of the course. You will receive credit for completing drafts of Steps 1- 5 according to the schedule below. Each of these steps are described in greater detail in the week they are due.

Step 1: Exploratory Data Analysis (EDA)

  • Task: Construct interleaved histograms and scatterplots to explore the relation between the predictors and response. Specifically, choose two predictors and make an interleaved histogram and scatterplot for each. Thus, you should construct four total graphs.

  • Deliverables:
    • Your code in Markdown should produce the plots.
    • Your 5-7 bullet points should explain the graphs including the axes and what each plot means. Remember, interleaved histograms may be easy for you but maybe not so for others.

Alcohol Percentage

ggplot(
  data = wine_red,
  aes(x=alcohol,fill=quality)
) + geom_histogram(position = 'dodge',bins = 30) +
  ggtitle("Histogram of Wine Quality by Alcohol Percent")+
  xlab("Alcohol %")

Total Sulfur Dioxide

ggplot(
  data = wine_red,
  aes(x=total.sulfur.dioxide,fill=quality)
) + geom_histogram(position = 'dodge',bins = 30) +
  ggtitle("Histogram of Wine Quality by Total Sulfur Dioxide") +
  xlab("Total Sulfur Dioxide")

Scatterplot of Alcohol Percentage and Total Sulfur Dioxide

ggplot(
  data = wine_red,
  aes(x=total.sulfur.dioxide,y=alcohol,color=quality)
) + geom_point(alpha=.5) +
  ggtitle("Histogram of Wine Quality by Total Sulfur Dioxide") +
  xlab("Total Sulfur Dioxide")+
  ylab("Alcohol %")

Graph Explanations

  • Higher Alcohol Percentage is associated with better quality wines
  • Lower Total Sulfur Dioxide is associated with better quality wines

Step 2: Variable Selection

  • Task: Perform variable selection via the AIC using the step() function. Your starting model should include all available predictors. The reduced model should be used as your final model for all subsequent steps. Or, if you disagree with the recommended model, you need to indicate why.

  • Deliverables:
    • Your code in Markdown should show how you used the function, produce the results, and indicate your final model
    • Your bullet points should give a brief explanation of the algorithm and comment on the variables retained/removed from the model. Are the results intuitive?

Create Binary Response Variable

wine_red <- wine_red %>% 
               mutate(quality=ifelse(quality=='high',1,0))

Create Saturated Model

saturated_model <- glm(quality~.,family=binomial,data=wine_red)

Use step() to select paired down model

aic_step <- step(saturated_model)
## Start:  AIC=1679.63
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol
## 
##                        Df Deviance    AIC
## - pH                    1   1655.9 1677.9
## - density               1   1656.0 1678.0
## - residual.sugar        1   1656.7 1678.7
## - fixed.acidity         1   1657.5 1679.5
## <none>                      1655.6 1679.6
## - citric.acid           1   1660.8 1682.8
## - chlorides             1   1662.1 1684.1
## - free.sulfur.dioxide   1   1662.9 1684.9
## - total.sulfur.dioxide  1   1690.2 1712.2
## - sulphates             1   1698.8 1720.8
## - volatile.acidity      1   1705.9 1727.9
## - alcohol               1   1730.0 1752.0
## 
## Step:  AIC=1677.9
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + sulphates + alcohol
## 
##                        Df Deviance    AIC
## - density               1   1657.2 1677.2
## - residual.sugar        1   1657.5 1677.5
## <none>                      1655.9 1677.9
## - citric.acid           1   1661.2 1681.2
## - chlorides             1   1662.1 1682.1
## - fixed.acidity         1   1662.8 1682.8
## - free.sulfur.dioxide   1   1663.0 1683.0
## - total.sulfur.dioxide  1   1690.7 1710.7
## - sulphates             1   1701.1 1721.1
## - volatile.acidity      1   1706.8 1726.8
## - alcohol               1   1751.2 1771.2
## 
## Step:  AIC=1677.19
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     sulphates + alcohol
## 
##                        Df Deviance    AIC
## - residual.sugar        1   1657.8 1675.8
## <none>                      1657.2 1677.2
## - citric.acid           1   1662.5 1680.5
## - chlorides             1   1663.1 1681.1
## - fixed.acidity         1   1663.3 1681.3
## - free.sulfur.dioxide   1   1664.2 1682.2
## - total.sulfur.dioxide  1   1691.4 1709.4
## - sulphates             1   1701.1 1719.1
## - volatile.acidity      1   1713.3 1731.3
## - alcohol               1   1839.5 1857.5
## 
## Step:  AIC=1675.84
## quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + sulphates + 
##     alcohol
## 
##                        Df Deviance    AIC
## <none>                      1657.8 1675.8
## - citric.acid           1   1663.0 1679.0
## - chlorides             1   1663.5 1679.5
## - fixed.acidity         1   1664.2 1680.2
## - free.sulfur.dioxide   1   1665.2 1681.2
## - total.sulfur.dioxide  1   1691.4 1707.4
## - sulphates             1   1701.2 1717.2
## - volatile.acidity      1   1713.4 1729.4
## - alcohol               1   1843.8 1859.8

Final Model Selected by step()

formula(aic_step)
## quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + sulphates + 
##     alcohol

Comments on Variables Selected:

  • How the algorithm works:
    • Step starts with the saturated model and decides iteratively which variables it can remove
    • At each step it computes the AIC for a model with each single predictor removed as well as one for no predictors removed
    • If the model with a predictor removed gives a better AIC than one with no predictor removed step removes the predictor that gives the lowest AIC model it repeats this process until the model reaches a number of predictors where removing no predictor gives the lowest AIC
  • Results:
    • The results show that pH, residual sugar and density are not good predictors of quality with red wine.
    • Also during the step process the removal of pH caused the fixed acidity variable to contribute more and become a relevant predictor since those variables are highly related.
    • Both of the variables I selected during my EDA section were relevant predictors and remained in the model

Step 3: Assess Model Fit

  • Task: Check that continuous predictors have a linear relation with the logit using loess plots and perform the Hosmer-Lemeshow (HL) goodness of fit test. If the loess plot is nonlinear, then splines should be used to account for the nonlinearity.

  • Deliverables:
    • Your code in Markdown should show how you used the function, produce the results, and indicate your final model
    • Your bullet points should explain the axes of the loess plot, the conclusion of the loess plot, the basic concepts of the HL test, and its result. If the HL test reveals poor fit, this should be discussed.

Loess Plot

loess_plot <- function(column_name){
  f <- as.formula(paste("quality", column_name, sep = " ~ "))
  y_smooth <- predict(loess(f, data=wine_red))
  zero_one <- which(y_smooth>0 & y_smooth<1)

  plot(jitter(wine_red[[column_name]])[zero_one],log(y_smooth[zero_one]/(1-y_smooth[zero_one])),
       xlab = column_name)
}

Alcohol

loess_plot('alcohol')

The lowess plot for Alcohol clearly shows around 2 splines around 9.5% and 12.5%

Volatile Acidity

loess_plot('volatile.acidity')

The lowess plot for Volatile Acidity does not show a strong need for splines

Sulphates

loess_plot('sulphates')

The lowess plot for Sulphates shows a clear need for a spline around 1

Total Sulfur Dioxide

loess_plot('total.sulfur.dioxide')

The lowess plot for Total Sulfur Dioxide shows a potential spline around 40 is needed

Free Sulfur Dioxide

loess_plot('free.sulfur.dioxide')

The lowess plot for Free Sulfur Dioxide shows splines around 10 and 15 could improve the model

Fixed Acidity

loess_plot('fixed.acidity')

The lowess plot for Fixed Acidity shows that splines around 8 and 12 are needed

Chlorides

loess_plot('chlorides')

The lowess plot for Chlorides shows a spline around .09 should be added

Citric Acid

loess_plot('citric.acid')

The lowess plot for Citric Acid shows that splines around .2 and .5 should be added

Adding Splines

spline_formula <- formula(quality~ns(citric.acid,4)+
                                  ns(chlorides,2)+
                                  ns(fixed.acidity,2)+
                                  ns(free.sulfur.dioxide,4)+
                                  ns(total.sulfur.dioxide,3)+
                                  ns(sulphates,4)+
                                  volatile.acidity+
                                  ns(alcohol,4))
step_with_splines_model <- glm(spline_formula,family=binomial,data=wine_red)

Hosmer-Lemeshow Goodness of Fit Test

Model without Splines

hl_test <- hoslem.test(wine_red$quality,aic_step$fitted.values,10)
hl_test
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  wine_red$quality, aic_step$fitted.values
## X-squared = 16.205, df = 8, p-value = 0.03954

Model with Splines

hl_test <- hoslem.test(wine_red$quality,step_with_splines_model$fitted.values,10)
hl_test
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  wine_red$quality, step_with_splines_model$fitted.values
## X-squared = 10.983, df = 8, p-value = 0.2026

Comments on Loess Plot and HL GoF Test:

  • Most of the Loess Plots show that the predictors are nonlinear except Volatile Acidity and thus splines need to be used to create multiple regression lines for each of the remaining predictors in the model. Another choice could be to select a model like a decision tree or neural network that can model features with nonlinear behaviors

  • HL Tests
    • The HL test on the model fit with the step() function shows that the model fit is not adequate. This is likely due to the fact that the predictors are mostly non-linear.
    • After adding splines the model fit shows that it is adequate

Step 4: Model Inferences

  • Task: Report p-values and confidence intervals for significant predictors and check for influential observations. Any influential observations should be removed and the model should be refit. Note any changes in the inferences due to the removal.

  • Deliverables:
    • Your code in Markdown should show how you used the function, produce the results, and indicate your final model
    • Your bullet points should give practical interpretation of the inferences and explain the process of checking for influential observations.
summary(step_with_splines_model)
## 
## Call:
## glm(formula = spline_formula, family = binomial, data = wine_red)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5819  -0.8058   0.2813   0.7730   2.4647  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -0.78309    1.28210  -0.611  0.54134    
## ns(citric.acid, 4)1          -0.18255    0.30052  -0.607  0.54356    
## ns(citric.acid, 4)2          -1.45620    0.51768  -2.813  0.00491 ** 
## ns(citric.acid, 4)3          -0.83535    0.80108  -1.043  0.29705    
## ns(citric.acid, 4)4           0.69174    1.38556   0.499  0.61760    
## ns(chlorides, 2)1            -1.83924    1.01853  -1.806  0.07095 .  
## ns(chlorides, 2)2            -1.73198    1.39238  -1.244  0.21354    
## ns(fixed.acidity, 2)1         3.07964    0.79158   3.891  0.00010 ***
## ns(fixed.acidity, 2)2        -0.15121    0.67374  -0.224  0.82242    
## ns(free.sulfur.dioxide, 4)1   0.58704    0.55610   1.056  0.29114    
## ns(free.sulfur.dioxide, 4)2  -0.02082    0.56275  -0.037  0.97048    
## ns(free.sulfur.dioxide, 4)3   0.58175    1.23333   0.472  0.63715    
## ns(free.sulfur.dioxide, 4)4   1.89145    1.10156   1.717  0.08597 .  
## ns(total.sulfur.dioxide, 3)1 -2.35477    0.56903  -4.138 3.50e-05 ***
## ns(total.sulfur.dioxide, 3)2 -1.34006    1.07490  -1.247  0.21252    
## ns(total.sulfur.dioxide, 3)3 -1.74398    1.43544  -1.215  0.22439    
## ns(sulphates, 4)1             2.59607    0.79839   3.252  0.00115 ** 
## ns(sulphates, 4)2             2.67761    0.68466   3.911 9.20e-05 ***
## ns(sulphates, 4)3             4.35350    1.83747   2.369  0.01782 *  
## ns(sulphates, 4)4             2.52875    1.09639   2.306  0.02109 *  
## volatile.acidity             -3.41227    0.51351  -6.645 3.03e-11 ***
## ns(alcohol, 4)1              -0.40314    0.68396  -0.589  0.55558    
## ns(alcohol, 4)2               2.62487    0.65818   3.988 6.66e-05 ***
## ns(alcohol, 4)3               1.03219    1.61410   0.639  0.52251    
## ns(alcohol, 4)4               3.54187    1.34294   2.637  0.00835 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2209.0  on 1598  degrees of freedom
## Residual deviance: 1583.7  on 1574  degrees of freedom
## AIC: 1633.7
## 
## Number of Fisher Scoring iterations: 5

Comments on Inferences and Influential Observations:

5: Asses Predictive Power

  • Task: Summarize predictive/discriminatory power of the model with an ROC curve and plots of predicted probabilities.

  • Deliverables:
    • Your code in Markdown should show how you used the function, produce the results, and indicate your final model
    • Your bullet points should give a brief explanation of each plot and the interpretation for your model.

Plot ROC Curve

pred <- prediction(step_with_splines_model$fitted.values, wine_red$quality)
perf <- performance(pred, measure = "tpr", x.measure = "fpr") 
plot(perf, col=rainbow(10), main="ROC Curve")

Determine AUC

auc.tmp <- performance(pred,"auc"); 
auc <- as.numeric(auc.tmp@y.values)
cat(paste("The AUC for my selected model is:",round(auc,3)))
## The AUC for my selected model is: 0.839

Plot Predicted Probabilities by Predictor

plot(wine_red$fixed.acidity,step_with_splines_model$fitted.values)

plot(wine_red$volatile.acidity,step_with_splines_model$fitted.values)

plot(wine_red$citric.acid,step_with_splines_model$fitted.values)

plot(wine_red$chlorides,step_with_splines_model$fitted.values)

plot(wine_red$free.sulfur.dioxide,step_with_splines_model$fitted.values)

plot(wine_red$total.sulfur.dioxide,step_with_splines_model$fitted.values)

plot(wine_red$sulphates,step_with_splines_model$fitted.values)

plot(wine_red$alcohol,step_with_splines_model$fitted.values)

Comments on Plots and Model Interpretation: